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Abstract 

The Lipschitz multi-armed bandit (MAB) problem generalizes the classical multi-armed bandit prob- 
lem by assuming one is given side information consisting of a priori upper bounds on the difference in 
expected payoff between certain pairs of strategies. Classical results of Lai-Robbins OTI and Auer et 
al. [4] imply a logarithmic regret bound for the Lipschitz MAB problem on finite metric spaces. Recent 
results on continuum-armed bandit problems and their generalizations imply lower bounds of \ft, or 
stronger, for many infinite metric spaces such as the unit interval. Is this dichotomy universal? We prove 
that the answer is yes: for every metric space, the optimal regret of a Lipschitz MAB algorithm is either 
bounded above by any / s w(logi), or bounded below by any g 6 o(\fi). Perhaps surprisingly, this 
dichotomy does not coincide with the distinction between finite and infinite metric spaces; instead it 
depends on whether the completion of the metric space is compact and countable. Our proof connects 
upper and lower bound techniques in online learning with classical topological notions such as perfect 
sets and the Cantor-Bendixson theorem. 

We also consider the full-feedback (a.k.a., best-expert) version of Lipschitz MAB problem, termed 
the Lipschitz experts problem, and show that this problem exhibits a similar dichotomy. We proceed to 
give nearly matching upper and lower bounds on regret in the Lipschitz experts problem on uncountable 
metric spaces. These bounds are of the form 0(£ 7 ), where the exponent 7 6 [|, 1] depends on the 
metric space. To characterize this dependence, we introduce a novel dimensionality notion tailored to 
the experts problem. Finally, we show that both Lipschitz bandits and Lipschitz experts problems become 
completely intractable (in the sense that no algorithm has regret o(t)) if and only if the completion of the 
metric space is non-compact. 
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1 Introduction 



Multi-armed bandit (henceforth, MAB) problems have been studied for more than fifty years as a clean 
abstract setting for analyzing the exploration-exploitation tradeoffs that are common in sequential decision 
making. In the stochastic MAB problem, an algorithm must repeatedly choose from a fixed set of strategies 
(a.k.a. "arms"), each time receiving a random payoff whose distribution depends on the strategy selected^ 
The performance of MAB algorithms is commonly evaluated in terms of regret: the difference in expected 
payoff between the algorithm's choices and always playing one fixed strategy. In addition to their many 
applications — which range from experimental design to online auctions and web advertising — another 
appealing feature of multi-armed bandit algorithms is that they are surprisingly efficient in terms of the 
growth rate of regret: for finite-armed bandit problems, algorithms whose regret at time t scales as 0(log t) 
have been known for more than two decades, beginning with the seminal work of Lai and Robbins OTTl and 
extended in subsequent work such as [4]. 

Many of the applications of MAB problems — especially the computer science applications such as 
online auctions, web advertising, or adaptive routing — require considering strategy sets which are very 
large or even infinite. For infinite strategy sets the O(logt) bound does not apply, while for very large 
finite sets the O(-) notation masks a prohibitively large constant. Indeed, without making any assumptions 
about the strategies and their payoffs, bandit problems with large strategy sets allow for no non-trivial 
solutions — any MAB algorithm performs as badly, on some inputs, as random guessing. This motivates 
the study of bandit problems in which the strategy set is large but one is given side information constraining 
the form of the payoffs. Such problems have become the subject of quite intensive study in recent years, 

e.g. m m m 123 m m m an \m \m n m m m m. 

The Lipschitz, MAB problem is a version of the stochastic MAB problem in which the side information 
consists of a priori upper bounds on the difference in expected payoff between certain pairs of strategies. 
This models situations where the decision maker has access to some similarity information about strategies 
which ensures that similar strategies obtain similar payoffs. Abstractly, the similarity information may be 
modeled as defining a metric space structure on the strategy set, and the side constraints imply that the 
expected payoff function p, is a Lipschitz function (with Lipschitz constant 1) on this metric space. 

The Lipschitz MAB problem was introduced by Kleinberg et al. [HH Preceding work |[2ll6l [T3ll27l[35ll 
has studied the problem in a few specific metric spaces such as a one-dimensional real interval. The prior 
work considered regret R(t) as a function of time t, and focused on the asymptotic dependence of R(t) 
on, loosely speaking, the dimensionality of the metric space. Various upper and lower bounds of the form 
R(t) = 0(i 7 ) were proved, where the exponent 7 < 1 depends on the metric space. In particular, if the 
metric space is the interval [0, 1] with the standard metric d(x, y) = \x — y\, then there exists an algorithm 
with regret R(t) = 0(t 2 / 3 ), and this bound is tight up to polylog factors 11271 . More generally, for an 
arbitrary infinite metric space (X, d) one can define an isometry invariant 7 = j(X, d) G [|, 1] such that 
there exists an algorithm with regret R(t) = 0(i 7 ), which is tight up to polylog factors if 7 > i; see 11301 . 

The following picture emerges. Although algorithms with regret R(t) = 0{f 1 ), 7 < 1 are known for 
most metric spaces, existing work unfortunately provides no examples of infinite metric spaces admitting 
bandit algorithms satisfying the Lai-Robbins regret bound R(t) = O(logt), although this bound holds for 
all finite metrics. In fact, for most metric spaces that have been studied (such as the unit interval) this 
possibility is excluded by known lower bounds of the form R(t) o(t 7 ), where 7 > |. Therefore it is 
natural to ask, 



'More precisely, the payoff of each arm is an independent sample from a fixed time-invariant distribution with bounded support. 
2 Megiddo and Hazan [25] consider a somewhat related (but technically very different) setting which combines full feedback, 
contextual "hints", convex payoffs, and (essentially) a similarity metric space on the contexts. 
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• Is 0(y/i) regret the best possible for an infinite metric space? Alternatively, are there infinite metric 
spaces for which one can achieve regret 0(log i)? Is there any metric space for which the best possible 
regret is between O(logi) and 0(y/i)7 

Our contributions. To make the above issue more concrete, let us put forward the following definition. 

Definition 1.1. Consider the Lipschitz MAB problem on a fixed metric space. A bandit algorithm is f in- 
tractable if for any problem instance T the algorithm's regret is i?(t) = Oj(/(£)). H The problem is 
fit) -tractable if such an algorithm exists. 

We settle the questions listed above by proving the following dichotomy. 

Theorem 1.2. Consider the Lipschitz MAB problem on a fixed metric space (X, d). Then the following 
dichotomy holds: either the problem is f if) -tractable for every f £ cj(logi), or it is not g{t)-tractable for 
any g £ o(vt). In fact, the former occurs if and only if the completion of X is a compact metric space with 
countably many points. 

It is worth mentioning that the regret bound = Oj(logt) is the best possible, even for two-armed 
bandit problems, by a lower bound of Lai and Robbins OTTl . Thus our upper bound for Lipschitz MAB 
problems in compact, countable metric spaces is nearly the best possible bound for such spaces, modulo 
the gap between "f(t) = logt" and "V/ £ w(logi)". Furthermore, we show that this gap is inevitable for 
infinite metric spaces: 

Theorem 1.3. For every infinite metric space iX, d), the Lipschitz MAB problem on (X, d) is not (log in- 
tractable. 

We turn our attention to the. full-feedback version of the Lipschitz MAB problem. For any MAB problem 
there exists a corresponding full-feedback problem in which after each round, the payoffs from all strategies 
are revealedQ Such settings have been extensively studied in the online learning literature under the name 
best experts problems ifTTl PT2l 1371 . In particular, for a finite set of strategies one can achieve a constant 
regret |29l when payoffs are i.i.d. over time. 

In addition to the full feedback, one could also consider a version in which the payoffs are revealed for 
some but not all strategies. Specifically, we define the double feedback, where in each round the algorithm 
selects two strategies: the "bet" for which it receives the payoff, and the "free peek". After the round, the 
payoffs are revealed for both strategies. By abuse of notation, we will treat the bandit setting as a special 
case of the experts setting. 

The experts version of the Lipschitz MAB problem, called the Lipschitz experts problem, is defined in 
the obvious way: a problem instance is specified by a triple (X, d, P), where (X, d) is a metric space and P 
is a Borel probability measure on the set [0, 1]^ of payoff functions on X (with the Borel cr-algebra induced 
by the product topology on [0, l] x ) such that the expected payoff function x h-> £y 6P [/(x)] is a Lipschitz 
function on (X, d). In each round an algorithm is presented with an i.i.d. sample from P. The metric 
structure of (X, d) is known to the algorithm, the measure P is not. We show that the Lipschitz experts 
problem exhibits a dichotomy similar to the one in Theorem 11.21 We formulate the upper bound for the 
double feedback, and the lower bound for the full feedback, thus avoiding the issue of what it means for an 
algorithm to receive feedback for infinitely many strategies. 

Theorem 1.4. The Lipschitz experts problem on a fixed metric space (X, d) is either 1-tractable, even with 
double feedback, or it is not git) -tractable for any g £ o(\/t), even with full feedback. The former occurs if 
and only if the completion of X is a compact metric space with countably many points. 

3 The notation Ox() means that the constant in OQ can depend on I. 
4 Formally, an algorithm can query an arbitrary finite number of strategies. 



3 



Theorems 11.21 and 11.41 assert a dichotomy between metric spaces on which the Lipschitz MAB/experts 
problem is very tractable, and those on which it is somewhat tractable. Let us consider the opposite end 
of the "tractability spectrum" and ask for which metric spaces the problem becomes completely intractable. 
We obtain a precise characterization: the problem is completely intractable if and only if the metric space is 
not pre-compact. Moreover, our upper bound is for the bandit setting, whereas the lower bound is for full 
feedback. 

Theorem 1.5. The Lipschitz. experts problem on a fixed metric space (X, d) is either f (t)-tractable for some 
f £ o(t), even in the bandit setting, or it is not g{t)-tractable for any g £ o(t), even with full feedback. The 
former occurs if and only if the completion of X is a compact metric space. 

Consider the full-feedback Lipschitz experts problem. In view of the \ft lower bound from Theo- 
rems 1 1 .41 we are interested in matching upper bounds. Gupta et al. (23 j observed that such bounds hold for 
every metric space (X, d) of finite covering dimension: namely, the Lipschitz experts problem on (X, d) is 
\/t-tractable. (Their algorithm is a version of the "naive algorithm" from |[27l[30l .) Therefore it is natural 
to ask whether there exist metric spaces for which the optimal regret in the Lipschitz experts problem is 
between \/t and t. We settle this question by proving a characterization with nearly matching upper and 
lower bounds in terms of a novel dimensionality notion tailored to the experts problem. 

Theorem 1.6. For any metric space (X, d), there exist an isometry invariant b = b(X, d) such that the 
full-feedback Lipschitz experts problem on (X, d) is (i 7 ) -tractable for any 7 > and not (p)-tractable 
for any 7 < Depending on the metric space, b(X, d) can take any value on a dense subset of [0, 00). 

The lower bound in Theorem l 1 . 6l holds for a restricted version of full-feedback Lipschitz experts problem 
in which a problem instance (X, d, P) satisfies a further property that each function / G support (P) is 
itself a Lipschitz function on (X, d). We term this version the uniformly Lipschitz experts problem (with full 
feedback). In fact, for this version we obtain a matching upper bound. 

Theorem 1.7. Consider the uniformly Lipschitz. experts problem with full feedback. Fix an uncountable 
metric space (X, d). Let b = b(X, d) the isometry invariant from Theorem \1.6\ Then the problem on (X, d) 
is (t"')-tractable for any 7 > max(^-, and not (t 7 ) -tractable for any 7 < max(^y-, ^). 

Connection to point-set topology. The main technical contribution of this paper is an interplay of on- 
line learning and point-set topology, which requires novel algorithmic and lower-bounding techniques. In 
particular, the connection to topology is essential in the (joint) proof of the two main results (Theorem 11.21 
and Theorem 1 1 -4b - There, we identify a simple topological property (well-orderability) which entails the 
algorithmic result, and another topological property (perfectness) which entails the lower bound. 

Definition 1.8. Consider a topological space X. X is called perfect if it contains no isolated points. A 
topological well-ordering of X is a well-ordering (X, -<) such that every initial segment thereof is an open 
set. If such -< exists, X is called well-orderable. A metric space (X, d) is called well-orderable if and only 
if its metric topology is well-orderable. 

Perfect spaces are a classical notion in point-set topology. Topological well-orderings are implicit in the 
work of Cantor [ 10], but the particular definition given here is new, to the best of our knowledge. 

The proof of Theorems 11.21 and 1 1 .4 1 ( for compact metric spaces) consists of three parts: the algorithmic 
result for a compact well-orderable metric space, the lower bound for a metric space with a perfect subspace, 
and the following lemma that ties together the two topological properties. 
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Lemma 1.9. For any compact metric space (X, d), the followingare equivalent: ( i) X is a countable set, 
( ii) (X, d) is well-orderable, ( Hi) no subspace of (X, d) is perfect^ 



Lemma[L9]follows from classical theorems of Cantor-Bendixson iflOl and Mazurkiewicz-Sierpinski l32l . 
We provide a proof in Appendix O for the sake of making our exposition self-contained. 

To reduce the Lipschitz MAB problem to complete metric spaces we show that the problem is f(t)- 
tractable on a given metric space if and only if it is /(i) -tractable on the completion thereof. Same is true 
for the double-feedback Lipschitz experts problem, and the "only if direction holds for the full-feedback 
Lipschitz experts problem. Then the main dichotomy results follow from the lower bound in Theorem [13] 

Accessing the metric space. We define a bandit algorithm as a (possibly randomized) Borel measurable 
function that maps a history of past observations (xj, rj) € X x [0, 1] to a strategy x 6 X to be played in 
the current period. An experts algorithm is similarly defined as a (possibly randomized) Borel measurable 
function mapping the observation history to a strategy x E X to be played in the current period. (Or, in 
the case of the double feedback model, a pair of strategies representing the "bet" and "free peek".) The 
observation history is either a sequence of elements of [0, l] x in the full feedback model, or a sequence of 
quadruples (xj, n, x[, r' { ) £ (X x [0, l]) 2 in the double feedback model. 

These definitions abstract away a potentially thorny issue of representing and accessing an infinite metric 
space. For our algorithmic results, we handle this issue as follows: the metric space is accessed via well- 
defined calls to a suitable oracle. Moreover, the main algorithmic result in Theorems 11.21 and 1 1 .41 requires 
an oracle which represents the well-ordering. We also provide an extension in Section [6j an cj(logt)- 
tractability result for a wide family of metric spaces - including, for example, compact metric spaces with a 
finite number of limit points - for which a more intuitive oracle access suffices. These are the metric spaces 
with a finite Cantor-Bendixson rank, a classic notion from point-set topology. 

Related work and discussion. Algorithms for the stochastic MAB problem admit regret guarantees of the 
form R(t) = 0(f(t)), which are of two types - instance-specific and instance-independent - depending on 
whether the constant in 0() is allowed to depend on the problem instance. For instance, UCB 1 [4] admits an 
instance-specific guarantee R(t) = O(logi), whereas the best-known instance-independent guarantee for 
this algorithm is only R(t) = 0(\/kt log t), where k is the number of arms. Accordingly, a lower bound 
for the instance-independent version has to show that for any algorithm and a given time t, there exists a 
problem instance whose regret is large at this time, whereas for the instance-specific version one needs a 
much more ambitious argument: for any algorithm there exists a problem instance whose regret is large 
infinitely often. In this paper, we focus on instance-specific guarantees. 

Apart from the stochastic MAB problem considered in this paper, several other MAB formulations have 
been studied in the literature (see |[T2l for background). Early work |[20l [T91 has focused on Bayesian for- 
mulations in which Bayesian priors on payoffs are known, and goal is to maximize the payoff in expectation 
over these priors. In these formulations, an MAB instance is a Markov Decision Process (MDP) in which 
each arm is represented by a Markov Chain with rewards on states, and the transition happens whenever 
the arm is played. In the more "difficult" restless bandits ll38l [9j [3H formulations, the state also changes 
when the arm is passive, according to another transition matrix. In the theoretical computer science litera- 
ture, recent work in this vein includes ||2T1|221 . Interestingly, these Bayesian formulations have an offline 
flavor: given the MDP, one needs to efficiently compute a (nearly) optimal mapping from states to actions. 
Contrasting the Bayesian formulations in which the probabilistic model is fully specified, the adversarial 
MAB problem [5J [3l HH makes no stochastic assumptions whatsoever. Instead, it makes a very pessimistic 
assumption that payoffs are chosen by an adversary that has access to the algorithm's code but not to its 

5 For arbitrary metric spaces we have (ii) •<==> (iii) and (i)=Kii), but not (ii)=^(i). 
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random seed. As in the stochastic MAB problem, the goal is to minimize regret. For any fixed (finite) 
number of arms, the best possible regret in this setting is R(t) = 0(y/i) (5). For infinite strategy sets, one 
often considers the linear MAB problem in which strategies lie in a convex subset of W 1 , and in each round 
the payoffs form a linear function 1551 171 [151 [Tl 1231. 

It is an open question whether the ideas from the Lipschitz MAB problem extend to the above for- 
mulations. The adversarial version of the Lipschitz MAB problem is well-defined, but to the best of our 
knowledge, the only known result is the "naive" algorithm from J27j. One could define the stochastic ver- 
sion of the linear MAB problem (in which the expected payoffs form a fixed time-invariant linear function), 
which can be viewed as a special case of the Lipschitz MAB problem. However, this view is not likely to 
be fruitful because in the Lipschitz MAB problem measuring a payoff of one arm is useless for estimating 
the payoffs of distant arms, whereas in prior work on the linear MAB problem inferences about distant arms 
are crucial. For Bayesian MAB problems with limited similarity information, it is not clear how to model 
this information, mainly because in the Bayesian setting similarity between arms is naturally represented 
via correlated priors rather than a metric space. 

Organization of the paper. Preliminaries are in Section [2] We present a joint proof for the two main 
results (Theorems 11.21 and 1 1 -4b - The lower bound is proved in Section [5J and the algorithmic results are 
in Section [3J Coupled with the topological equivalence (Lemma 11.91 ), this gives the proof for compact 
metric spaces. A complementary (log t) -intractability result for infinite metric spaces (Theorem 1 1.31 ) is in 
Section [5] The w(log intractability result via simpler oracle access (for metric spaces of finite Cantor- 
Bendixson rank) is in Section [6j The boundary-of-tractability result (Theorems 11.51 ) is in Section [7J The 
full-feedback Lipschitz experts problem in a (very) high dimension (including Theorems 11.61 and 11.71 ) is 
discussed in Sections [8] and [9] 

Some of the proofs are moved to appendices. In Appendix|A]we reduce the problem to that on complete 
metric spaces. All KL-divergence arguments (which underlie our lower bounds) are gathered in Appendix iBl 
We provide a self-contained proof of the topological lemma (Lemma [L9l in Appendix O 

2 Preliminaries. 

This section contains various definitions which make the paper essentially self-contained (the only exception 
being ordinal numbers which are used in Section I9.2I ). In particular, the paper uses notions from General 
Topology which are typically covered in any introductory text or course on the subject. 

Lipschitz MAB problem. Consider the Lipschitz MAB problem on a metric space (X, d) with payoff 
function fi. The payoff from each arm x E X is an independent sample from a fixed (time-invariant) 
distribution with support in [0, 1] and expectation such that — p>(y)\ < d(x, y) for all x, y € X. 
For S C X denote sup(^, S) = sup x£S n(x) and similarly argmax(^, S) = argmax^g fi(x). Given a 
bandit algorithm A, let Pt^ ;fM ) (t) be the expected reward collected by the algorithm in the first t rounds 
on the problem instance (X, d, /i). The regret of algorithm A in t rounds is RfA^if) = sup(^, X) t — 
P(A,/i) Given a Lipschitz experts algorithm A and a problem instance (X, d, P), the notations P ) (t) 
and i2(_4 jP )(i) — denoting expected reward and regret — are defined analogously. 

Metric topology. Let (X, d) be a metric space. An open ball in (X, d) is denoted B(xq, r) = {x G X : 
d(x,xo) < r}, where xq € X is the center, and r > is the radius. A Cauchy sequence in (X,d) is 
a sequence such that for every 5 > 0, there is an open ball of radius 5 containing all but finitely many 
points of the sequence. We say X is complete if every Cauchy sequence has a limit point in X. For two 
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Cauchy sequences x = x\, X2, ■ ■ • and y = yi,y2, ■ ■ ■ the distance d(x, y) = limi_ i . 00 d(xi,yi) is well- 
defined. Two Cauchy sequences are declared to be equivalent if their distance is 0. The equivalence classes 
of Cauchy sequences form a metric space (X*,d) called the completion of (X,d). The subspace of all 
constant sequences is identified with (X, d) : formally, it is a dense subspace of {X* , d) which is isometric 
to (X, d). A metric space (X, d) is compact if every collection of open balls covering (X, d) has a finite 
subcollection that also covers (X, d). Every compact metric space is complete, but not vice- versa. 

Let X be a set. A family T of subsets of X is called a topology if it contains and X and is closed 
under arbitrary unions and finite intersections. When a specific topology is fixed and clear from the context, 
the elements of T are called open sets, and their complements are called closed sets. Throughout this paper, 
these terms will refer to the metric topology of the underlying metric space, the smallest topology that 
contains all open balls (namely, the intersection of all such topologies). A point x is called isolated if the 
singleton set {x} is open. A function between topological spaces is continuous if the inverse image of every 
open set is open. 

Set theory. Let S be a set. A well-ordering on a set S is a total order on S with the property that every 
non-empty subset of S has a least element in this order. Each set can be well-ordered. (More precisely, this 
statement is equivalent to the Axiom of Choice.) 

In Section [9^2] use ordinals, a.k.a. ordinal numbers, are a classical concept in set theory that, in some 
sense, extend natural numbers beyond infinity. Understanding this paper requires only the basic notions 
about ordinals, namely the standard (von Neumann) definition of ordinals, successor and limit ordinals, and 
transfinite induction. The necessary material can be found in any introductory text on Mathematical Logic 
and Set Theory, and also on Wikipedia. 

3 Lower bounds via a perfect subspace 

In this section we prove the following lower bound: 

Theorem 3.1. Consider the Lipschitz experts problem on a metric space (X, d) which has a perfect sub- 
space. Then the problem is not g -tractable for any g E o(y/t). In fact, a much stronger result holds: there 
exist a distribution V over problem instances fi such that for any experts algorithm A we have 

(V 5 e o(Vt)) Pr [R iA ,»)(t) = 0^(g(t))] = 0. (1) 

Let us construct the desired distribution over problem instances. First, we use the existence of a perfect 
subspace to construct a useful system of balls. 

Definition 3.2. A ball-tree on a metric space (X, d) is a complete infinite binary tree whose nodes are pairs 
(x, r), where x E X is the "center" and r G (0, 1] is the "radius", such that: 

• if (x, r) is a parent of (x' ,r') then d(x, x') + r' < r/2, 

• if (x, r x ) and (y, r y ) are siblings, then r x + r y < d(x, y). 

In a ball-tree, each tree node (x,r) corresponds to a ball B{x,r) so that each child is a subset of its 
parent and any two siblings are disjoint!! 

Lemma 3.3. For any metric space with a perfect subspace there exists a ball-tree. 

6 Defining internal nodes as balls rather than (x, r) pairs could lead to confusion later in the construction because a ball in a 
metric space a set of points, and as such does not necessarily have a unique center or radius. 
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Proof. Consider a metric space (X, d) with a perfect subspace (Y, d). Let us construct the ball-tree recur- 
sively, maintaining the invariant that for each tree node (y, r) we have y G Y. Pick an arbitrary y € Y and 
let the root be (y, 1). Suppose we have constructed a tree node (y, r), y G Y. Since Y is perfect, the ball 
B(y,r/4) contains another point y' G Y. Let r' = d(y, y')/2 and define the two children of (y, r) as (y, r') 



Now let us use the ball-tree to construct the distribution on payoff functions. Consider a metric space 
(X, d) with a fixed ball-tree T. For each i > 1, let Di be the set of all depth- z tree nodes, and let r* = 
min{r : (x, r) G Di} be the smallest radius among these nodes. Note that r* < 2~ l . Choose a number 

rii large enough that g(n) < ^r*\/n for all n > rii, and let 6i = n i . For each tree node w = (xo, r^) 
define a function : X — > [0, 1] by 



It is easy to see that F w is a Lipschitz function on (X, d). A leaf in a ball-tree is an infinite path from the 
root: w = (wq, wi,W2, ■ ■ ■), where w G Di for all i. A lineage in aball-tree is a set of tree nodes containing 
at most one child of each node; if it contains exactly one child of each node then we call it a complete lineage. 
For each complete lineage A there is an associated leaf w(A) defined by w = (wo, wi, . . .) where wo is the 
root and for i > 0, Wi is the unique child of u^-i that belongs to A. Let us use a lineage in the ball-tree 
to define a probability measure Pa on payoff functions via the following sampling rule. First every node 
w independently samples a random sign sign(u>) G {+1, — 1}, assigning probability (1 + 5i)/2 to +1 if 
w G XdDi, and choosing the sign uniformly at random otherwise. Now define a payoff function it associated 
with this sign pattern as follows: ir = | + J2weT\D sig 11 ^)^- By construction, -ir is a Lipschitz function 
taking values in [0,1]. Let Vt be the distribution over problem instances in which A is a complete lineage 
sampled uniformly at random; that is, each node samples one of its children independently and uniformly at 
random, and A is the set of sampled children. This completes our construction. 

Remark. Let A be a complete lineage in the ball-tree, and let w(A) = (u>o,^i,^2, ...) be its associated leaf, 
where Wi = (xj,rj) for all i. Suppose for some i we have x S B(xi,ri/2) and x B(xi+\, Tj+i). Then 
the expected payoff function fi\ = E[?r] associated to the measure Pa satisfies fi\(x) = \ + Yl)=i r l^i/^- 

Lemma 3.4. Consider a metric space (X, d) with a ball-tree T. Then ([7]) holds with V = Vt- 

To prove this lemma, we define a notion called an (e, 5, /c)-ensemble, which is a collection of k payoff 
distributions that are nearly indistinguishable from the standpoint of an online learning algorithm. To this 
end, we consider a more general setting than the one in the Lipschitz experts problem. In the feasible experts 
problem, one is given a set X (not necessarily a metric space) along with a collection V of Borel probability 
measures on the set [0, l] x of functions n : X — > [0, 1]. A problem instance of the feasible experts problem 
consists of a triple (X, V, P) where X and V are known to the algorithm, and P G V is not. 

Definition 3.5. Consider a set X and a (k + l)-tuple P = (Po, Pi, . . . , P&) of Borel probability measures 
on [0, 1} X , the set of [0, l]-valued payoff functions 7r on X. For < i < k and x G X, let ^i{x) denote the 
expectation of ir{x) under measure Pj. We say that P is an (e, 5, k)-ensemble if there exist pairwise disjoint 
subsets Si , S2 , ■ ■ ■ , Sk C X for which the following properties hold: 

1. for every i and every event £ in the Borel a-algebra of [0, l] x , we have 1 — 5 < Po(£)/P? (£) < 1 + 6, 

2. for every i > 0, we have sup(/ij, Si) — sup(/i.;, X \ Si) > e. 



and B(y',r'). 



□ 




(2) 
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Theorem 3.6. Consider the feasible experts problem on (X,T>). Let P be an (e,5,k)-ensemble with 
{Pi,...,Pfe} C V and < e,5 < 1/2. Then for any t < In(l7k)/(2S 2 ) and any experts algorithm 
A, at least half of the measures P, have the property that Rf^^.\(t) > et/2. 

Remarks. For space reasons, the proof of this theorem has been moved to the appendix. The proof of 
Theorem [3j] uses Theorem l3.6l for k = 2, and the proof of Theorem 1 1.61 will use it again for large k. 

Proof of Lemma 13^41 Consider the ball-tree T. For each i > 1, recall that D t is the set of all depth-z nodes 
in T, and that r* = min{r : (x,r) G Di} is the smallest radius among these nodes. Let P be the set 
of all probability measures induced by the lineages of T. For each complete lineage A and tree node w in 
T, let w\,W2 denote the children of w, let i denote their depth, and let w' denote the unique element of 
{w\,W2} n A. The three lineages Ao = A \ {w'}, X\ = Ao U {w\}, A2 = Ao U {^2} define a triple of 
probability measures P = (Pa ,Pa 15 Pa 2 ) that constitute a (e, <5j, 2)-ensemble where e = r*<5j/4. 

Let us fix an experts algorithm A. By Theorem 13.61 there exists a(w) G {Pa 1 ,Pa 2 } suc h that for any ti 
satisfying 1/8} < t < ln(34)/(2^ 2 ), 

R(A,a( w ))(U) > eU/2 = r*5iti/8 > \r*yft. 

Recalling the definition of = 5~ 2 , we see that i ■ g(ti) < \r*y/Ti < R{A,a{w)){U)- 

For each i, let us define 6 j to be the set of input distributions Pa such that A is a complete lineage whose 
associated leaf w(A) = (^o, W\,...) satisfies wi = a{wi-\). Interpreting these sets as random events under 
the probability distribution Vt, we have proved the following: there exists a sequence of events £j, i G N 
and a sequence of times t{ — ► 00 such that for each i we have (i) Pr[£j| o~(£i, ■ ■ ■ , £i-i)] = \ and (ii) 
fl(A,p)(i») > i ■ g(k) for any P G Si. 

Now, let us fix an experts algorithm A. For each complete lineage A, define C\ := inf{C < 00 : 
R(A,r x ){t) < C g(t) for all t}. Note that R(A,p x )(t) = O^git)) if and only if C\ < 00. We claim that 
Pr [C\ < 00] = where the probability is over the random choice of complete lineage A. Indeed, if infinitely 
many events £j happen, then event {C^ < C} does not. But the probability that infinitely many events £j 

happen is 1, because for every positive integer n, Pr [n^ n ^] = P r ^ n }=n^? 



0. □ 



4 Tractability for compact well-orderable metric spaces 

In this section we prove the main algorithmic result. 

Theorem 4.1. Consider a compact well-orderable metric space (X, d). Then: 

(a) the Lipschitz MAB problem on (X, d) is f -tractable for every f G w(log t); 

(b) the Lipschitz experts problem on (X, d) is 1 -tractable, even with a double feedback. 

We present a joint exposition for both the bandit and the experts version. Let us consider the Lipschitz 
MAB/experts problem on a compact metric space (X, d) with a topological well-ordering -< and a payoff 
function /i. For each strategy x G X, let S(x) = {y ■< x : y G X} be the corresponding initial segment of 
the well-ordering (X, -<). Let p,* = sup(p, X) denote the maximal payoff. Call a strategy x G X optimal 
if p{x) = /i* . We rely on the following structural lemma: 

Lemma 4.2. There exists an optimal strategy x* G X such that sup(/x, X \ S(x*)) < /j,*. 

Proof. Let X* be the set of all optimal strategies. Since p is a continuous real- valued function on a compact 
space X, it attains its maximum, i.e. X* is non-empty, and furthermore X* is closed. Note that {S(x) : x G 
X*} is an open cover for X*. Since X* is compact (as a closed subset of a compact set) this cover contains 
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a finite subcover, call it {S(x) : x <E Y*}. Then the -< -maximal element of Y* is the -< -maximal element of 
X*. The initial segment S(x*) is open, so its complement Y = X \ S(x*) is closed and therefore compact. 
It follows that p attains its maximum on Y, say at a point y* G Y. By the choice of y* we have x* -< y*, so 
by the choice of x* we have p(x*) > p{y*). □ 

In the rest of this section we let x* be the strategy from Lemma |4~2~1 Our algorithm is geared towards 
finding x* eventually, and playing it from then on. The idea is that if we cover X with balls of a sufficiently 
small radius, any strategy in a ball containing x* has a significantly larger payoff than any strategy in a ball 
that overlaps with X\S(x*). 

The algorithm accesses the metric space and the well-ordering via the following two oracles. 

Definition 4.3. A 5-covering of a metric space (X, d) is a subset S C X such that each point in X lies 
within distance 5 from some point in S. An oracle O = 0(k) is a covering oracle for (X, d) if it inputs 
k G N and outputs a pair (5, S) where 5 = 5o (k) is a positive number and S is a 5-covering of X consisting 
of at most k points. Here 5q{-) is any function such that So(k) — > as k — > oo. 

Definition 4.4. Given a metric space (X, d) and a total order (X, -<), the ordering oracle inputs a finite 
collection of balls (given by the centers and the radii), and returns the -< -maximal element covered by the 
closure of these balls, if such element exists, and an arbitrary point in X otherwise. 

Our algorithm is based on the following exploration subroutine EXPL0. 

Algorithm 4.5. Subroutine EXPL(/c, n, r): inputs k, n £ N and r G (0, 1), outputs a point in X. 

First it calls the covering oracle O(k) and receives a 5-covering S of X consisting of at most k points. 
Then it plays each strategy x G S exactly n times; let p ay (x) be the sample average. Let us say that x a 
loser if Pa, v (y) — AW 0*0 > 2r + 5 for some y G S. Finally, it calls the ordering oracle with the collection of 
all closed balls B{x,5) such that x is not a loser, and outputs the point x or G X returned by this oracle call. 

Clearly, EXPL(fc, n, r) takes at most kn rounds to complete. We show that for sufficiently large k, n and 
sufficiently small r it returns x* with high probability. 

Lemma 4.6. Fix a problem instance and let x* be the optimal strategy from Lemma \4~2\ Consider increasing 
functions k,n,T : N —* N such that r(t) := 4y / (log T(i)) /n(t) — ► 0. Then for any sufficiently large t, 
with probability at least 1 — T~ 2 (t), the subroutine EXPL(&(t), n(i), r(t)) returns x*. 

Proof. Let us use the notation from Algorithm 14.51 Fix t and consider a run of EXPL(/c(i), n(i), r(t)). Call 
this run clean if for each x G S we have \p iscv (x) — p.{x)\ < r(t). By Chernoff Bounds, this happens with 
probability at least 1 — T~ 2 {t). In the rest of the proof, let us assume that the run is clean. 

Let B be the union of the closed balls B(x, 5), x G S*. Then the ordering oracle returns the -< -maximal 
point in B if such point exists. We will show that x* G B C S(x*) for any sufficiently large t, which will 
imply the lemma. 

We claim that x* G B. Since S is a 5-covering, there exists y* G S such that d(x*,y*) < 5. Let us fix 
one such y*. It suffices to prove that y* is not a loser. Indeed, if fJ, av (y) — fi av (y*) > 2r(t) + 5 for some 
y G S then p,(y) > fi(y*) + 5 > p,*, contradiction. Claim proved. 

Let = sup(/i, X \ S(x*)) and let vq = (p* — fio)/7. Let us assume that t is sufficiently large so that 
r(t) < tq and 5 = 5o(k(t)) < ro, where 5o(-) is from the definition of the covering oracle. 

We claim that B C S(x*). Indeed, consider x G S and t/£l \ S(x*) such that d(x, y) < 5. It suffices 
to prove that x is a loser. Consider some y* G S such that d(x* ,y*) < 5. Then by the Lipschitz condition 

/U av (y*) > n{y*) ~r > p* - 2r , 

Mav(#) < m(z) + r o < My) + r o < Mo + 2r < p* — 5r 
Mav(y*) - Mav(^) > 3r > 2r(t) + 5. □ 
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Proof of Theorem l4.lt Let us fix a function / G w(logt). Then f(t) = a(t)log(t) where a(t) — > oo. 
Without loss of generality, assume that a(t) is non-decreasing. (If not, then instead of f(t) use g(t) = 
/3(t) log(i), where (3(t) = mf{ a(f) : t' > t}-) 

For part (a), define k t = [y/ g(t) / log t\ , n t = [k t log t\ , and r t = 4-y/ (logi)/n t . Note that r t — > 0. 

The algorithm proceeds in phases of a doubly exponential length^. A given phase i = 1, 2, 3, . . . lasts 
for T = 2 2 ' rounds. In this phase, first we call the exploration subroutine EXPL(fcr, nx, ry). Let x OT G X 
be the point returned by this subroutine. Then we play x or till the end of the phase. This completes the 
description of the algorithm. 

Fix a problem instance X. Let Wi be the total reward accumulated by the algorithm in phase i, and let 
Ri = 2 2% /i* — Wi be the corresponding share of regret. By Lemma l4~6l there exists i$ = io(Z) such that for 
any phase i > iq we have, letting T = 2 T be the phase duration, that Ri < kxrix < g(T) with probability 
at least 1 - T~ 2 , and therefore E[Ri] < g(T) + T -1 . For any t > t = 2 2 ' it follows by summing over 
i G {io,^o + 1, • • • j [log log t] } that R^x(t) = O(to + git)). Note that we have used the fact that a(t) is 
non-decreasing. 

For part (b), we separate exploration and exploitation. For exploration, we run EXPL0 on the free 
peeks. For exploitation, we use the point returned by EXPL() in the previous phase. Specifically, define 
h = n t = [Vt\, and rt = 4\/ (t 1 / 4 )/^. The algorithm proceeds in phases of exponential length. A 
given phase i = 1,2,3,... lasts for T = 2 l rounds. In this phase, we run the exploration subroutine 
EXPL(&r, nr, Tt) on the free peeks. In each round, we bet on the point returned by EXPL0 in the previous 
phase. This completes the description of the algorithm. 

By Lemma 14.61 there exists zq = iq (I) such that in any phase i > io the algorithm incurs zero regret 
with probability at least 1 — e n W. Thus the total regret after t > 2 l ° rounds is at most to + 0(1). □ 



5 The (log t) -intractability for infinite metric spaces: proof of Theorem [O 



Consider an infinite metric space (X, d). In view of Theorem 1 1.51 we can assume that the completion X* of 
X is compact. It follows that there exists x* G X* such that Xi — > x* for some sequence x\,x%, ... G X. 
Let n = d(xi,x*). Without loss of generality, assume that n+% < \ r« for each i, and that the diameter of 
X is 1. 

Let us define an ensemble of payoff functions [n : X — ► [0, 1], i G N, where /io is the "baseline" 
function, and for each % > 1 function \i\ is the "counterexample" in which a neighborhood of Xi has slightly 
higher payoffs. The "baseline" is defined by fio(x) = \ — d ^ x % \ and the "counterexamples" are given by 

Hi(x) = no(x) + Vi{x), where Vi(x) = | max (0, y — d(x,x*)) . 

Note that both an d v% are |-Lipschitz and |-Lipschitz w.r.t. (X,d), respectively, so /ij is |-Lipschitz 
w.r.t (X, d). Let us fix a MAB algorithm A and assume that it is (log t)-tractable. Then for each i > there 
exists a constant Ci such that Rf^ <IH \(t) < C\ logt for all times t. We will show that this is not possible. 

Intuitively, the ability of an algorithm to distinguish between payoff functions fiQ and ^i,i> 1 depends 
on the number of samples in the ball Bi = B(xi, rj/3). (This is because p,Q = /Uj outside Bi.) In particular, 
the number of samples itself cannot be too different under an d under /ij, unless it is large. To formalize 
this idea, let Ni(t) be the number of times algorithm A selects a strategy in the ball Bi during the first t 
rounds, and let a(Ni(t)) be the corresponding a-algebra. Let Pj[-] and Ej[-] be, respectively, the distribution 
and expectation induced by p,{. Then we can connect Eo[A^(t)] with the probability of any event S G 
a(Ni(t)) as follows. 



7 The doubly exponential phase length is necessary in order to get /-tractability. If we employed the more familiar doubling 
trick of using phase length 2 1 (as in [5 27 . 30 1 for example) then the algorithm would only be f(t) log f-tractable. 
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Claim 5.1. For any i > 1 and any event S G o~{Ni(t)) it is the case that 

HS] < | <n[S] -ln(P<[5])-|<0(rf)Eo[JVi(t)]. (3) 

Claim I5TT1 is proved using KL-divergence techniques, see Appendix IB"! for details. To complete the proof of 
the theorem, we claim that for each i > 1 it is the case thatEo[JVi(t)] > n(r~ 2 logt) for any sufficiently 
large t. Indeed, fix i and let S = {Ni(t) < r^ 2 \ogt}. Since 

d log t > R {A w) (t) > P f (5) (t - rr 2 log i)f , 

it follows that P»(5) < r 1/2 < | for any sufficiently large t. Then by Claim O either P (5) < ± or the 
consequent in ([3]) holds. In both cases E [Ni(t)] > tt(r~ 2 logi). Claim proved. 

Finally, the fact that p, (x*) — ^o(x) > n/Vl for every x G Sj implies that Mo ) (i) > j|Eo[iV"i(t)] > 
^(r^ 1 logt) which establishes Theorem [T31 since r^ 1 — > oo as i — > oo. 

6 Tractability via more intuitive oracle access 

In Theorem 14. 1 1 the algorithm accesses the metric space via two oracles: a very intuitive covering oracle, 
and a less intuitive ordering oracle. In this section we show that for a wide family of metric spaces — 
including, for example, compact metric spaces with a finite number of limit points — the ordering oracle 
is not needed: we provide an algorithm which accesses the metric space via a finite set of covering oracles. 
We will consider metric spaces of finite Cantor-Bendixson rank, a classic notion from point topology. 

Definition 6.1. Fix a metric space (X, d). If for some x G X there exists a sequence of points in X \ {x} 
which converges to x, then x is called a limit point. For S C X let lim(5) denote the limit set: the set of 
all limit points of S. Let lim(5, 0) = S, and LIM(5, i) = lim(lim(- • • LIM (£?))), where lim(-) is applied i 
times. The Cantor-Bendixson rank of (X, d) is defined as sup{n : lim(X, n) ^ 0}. 

Let us say that a Cantor-Bendixson metric space is one with a finite Cantor-Bendixson rank. In order to 
apply Theorem 14. II we show that any such metric space is well-orderable. 

Lemma 6.2. Any Cantor-Bendixson metric space is well-orderable. 

Proof. Any finite metric space is trivially well-orderable. To prove the lemma, it suffices to show the 
following: any metric space (X, d) is well-orderable if so is (lim(X), d). 

Let X\ = X \ LIM(X) and X2 = LIM(X). Suppose (X2,d) admits a topological well-ordering -<2- 
Define a binary relation -< on X as follows. Fix an arbitrary well-ordering -<i on X\. For any x,y G X 
posit x -< y if either (i) x, y G X\ and x -<\ y, or (ii) x, y G X2 and x ^2 y, or (hi) x G X\ and y G X2. It 
is easy to see that (X, -<) is a well-ordering. 

It remains to prove that an arbitrary initial segment Y = {x G X : x -< y} is open in (X,d). We 
need to show that for each x G Y there is a ball B(x,e), e > which is contained in Y. This is true if 
x G X\ since by definition each such x is an isolated point in X. If x G X2 then Y = X\ U Y% where 
Y2 = {x G X2 : x -<2 y} is the initial segment of X2. Since Y2 is open in (X2, d), there exists e > such 
that B X2 (x, e) C Y 2 . It follows that B x {x, e) C B X2 (x, e)U Xi CY. □ 

The structure of a Cantor-Bendixson metric space is revealed by a partition of X into subsets Xj = 
LIM(X, i) \ lim(X, i + 1), < i < n. For a point x G Xi, we define the rank to be i. The algorithm 
requires a covering oracle for each Xi. 
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Theorem 6.3. Consider the Lipschitz. MAB/experts problem on a compact metric space (X, d) such that 
LlMjy(X) = for some N. Let Oi be the covering oracle for X{ = L1M(X, i) \ lim(X, i + 1). Assume 
that access to the metric space is provided only via the collection of oracles {Oi}^L . Then: 

(a) the Lipschitz MAB problem on (X, d) is f -tractable for every f G w(log t); 

(b) the Lipschitz experts problem on (X, d) is 1 -tractable, even with a double feedback. 

In the rest of this section, consider the setting in Theorem 16.31 We describe the exploration subroutine 
EXPL'(), which is similar to EXPL() in Section 0] but does not use the ordering oracle. Then we prove a 
version of Lemma l4~6l for EXPL'Q. Once we have this lemma, the proof of Theorem 16.31 is identical to that 
of Theorem 14. II (and is omitted). 

Algorithm 6.4. Subroutine EXPL'(/c, n, r): inputs k, n G N and r G (0, 1), outputs a point in X. 

Call each covering oracle Oi{k) and receive a ^-covering Si of X consisting of at most k points. Let 
S = U^S;. Play each strategy x G S exactly n times; let // a v(^) be the corresponding sample average. For 
x,y £ S, let us say that x dominates y if /x av (a;) — fi av (y) > 2 r. Call x G S a winner if x has a largest rank 
among the strategies that are not dominated by any other strategy. Output an arbitrary winner if a winner 
exists, else output an arbitrary point in S. 

Clearly, EXPL(fc, n, r) takes at most knN rounds to complete. We show that for sufficiently large k, n 
and sufficiently small r it returns an optimal strategy with high probability. 

Lemma 6.5. Fix a problem instance. Consider increasing functions k,n,T : N —* N such that r(t) := 
4y / (logT(i)) /n(t) — » 0. Then for any sufficiently large t, with probability at least 1 — T~ 2 (t), the subrou- 
tine EXPL'(/c(i), n(t), r(t)) returns an optimal strategy. 

Proof. Use the notation from Algorithm |6.4j Fix t and consider a run of EXPL'(/c(t), n(t), r(t)). Call this 
run clean if for each x G S we have |/x av (x) — ^(x)\ < r(t). By Chernoff Bounds, this happens with 
probability at least 1 — T~ 2 (t). In the rest of the proof, let us assume that the run is clean. 

Let us introduce some notation. Let fi be the payoff function and let fi* = sup( / u,X). Call x G X 
optimal if fi(x) = fi*. (There exists an optimal strategy since (X, d) is compact.) Let i* be the largest rank 
of any optimal strategy. Let X* be the set of all optimal strategies of rank i*. Let Y = lim(X, i*). Since 
each point x G Xi* is an isolated point in Y, there exists some r(x) > such that x is the only point of 
B(x, r(x)) that lies in Y. 

We claim that sup(p,,Y \ X*) < /x*. Indeed, consider C = U x ^x*B(x,r(x)). This is an open set. 
Since Y is closed, Y \ C is closed, too, hence compact. Therefore there exists y G Y \ C such that 
n{y) = sup(/x, Y \ C). Since X* C C, is not optimal, i.e. fi(y) < /i*. Finally, by definition of r(x) 
we have Y \ C = Y \ X*. Claim proved. 

Pick any x* G X*. Let /i = sup(/x, Y\X*). Assume that t is large enough so that r(t) < (fx* - /io)/4 
and 5i* < r(x*). Note that the <5j* -covering Si* contains x*. 

Finally, we claim that in a clean phase, x* is a winner, and all winners lie in X* . Indeed, note that x* 
dominates any non-optimal strategy y G S of larger or equal rank, i.e. any y G S n (Y \ X*). This is 
because /i av (»*) — Mav(z/) > — Mo — 2r > 2. The claim follows since any optimal strategy cannot be 
dominated by any other strategy. □ 



7 Boundary of tractability: Theorem [L5 



In this section we prove Theorem 11.51 In Appendix |A] we reduce the theorem to that on complete metric 
spaces. We will use a basic fact that a complete metric space is compact if and only if for any r > 0, it can 
be covered by a finite number of balls of radius r. 
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Algorithmic result. We consider a compact metric space (X, d) and use an extension of the naive algo- 
rithm from 11271 l30l . In each phase i (which lasts for ti round) we fix a covering of X with Ni < oo balls of 
radius 2~ % (such covering exists by compactness), and run a fresh instance of the iVj-armed bandit algorithm 
UCB 1 from ||4l on the centers of these balls. (This algorithm is for the "basic" MAB problem, in the sense 
that it does not look at the distances in the metric space.) The phase durations ti need to be tuned to the 
iVj's. In the setting considered in ll27l l30l (bounded covering dimension) it suffices to tune each ti to the 
corresponding ti in a fairly natural way. The difficulty in the present setting is that there are no guarantees on 
how fast the iVj's grow. To take this into account, we fine-tune each ti to (essentially) all covering numbers 
Ni, ...,N i+1 . 

Let Rk(t) be the expected regret accumulated by the algorithm in the first t rounds of phase k. Using 
the off-the-shelf regret guarantees for UCB 1 , it is easy to see ll27l l30l that 

R k (t) < 0(V 'N k t log t) + e k t<e k m<xx(f k , t), where t% = 2 ^ log (4) 

Let us specify phase durations tj. They are defined very differently from the ones in l27l l30l . In 
particular, in fl27ll30l each ti is fine-tuned to the corresponding covering number JVf by setting ti = t*, and 
the analysis works out for metric spaces of bounded covering dimension. In our setting, we fine-tune each ti 
to (essentially) all covering numbers Ni, . . . , Ni+i. Specifically, we define the ij's inductively as follows: 

U = mm(t*, t* +1 , 2^2 j^t j). 
This completes the description of the algorithm, call it A. 

Lemma 7.1. Consider the Lipschitz MAB problem on a compact and complete metric space (X, d). Then 
RA(t) < 5e(t) t, where e(t) = min{2 _fc : t < s k } and s k = ^i=i U. In particular, R^it) = °{t). 

Proof. First we claim that RA( s k) — 2 e k s k for each k. Use induction on k. For the induction base, note 
that Ra(si) = R\(ti) < e%ti by ((U). Assume the claim holds for some k — 1. Then 

Ra(sic) = RA(sk-i) + Rk(tk) 

< 2e/c_i Sk-i + £fc t k 

< 2e k (s k -i + tk) = 2e k s k , 

claim proved. Note that we have used (@]) and the facts that t k > t* k and t k > 2s k _\. 
For the general case, let T = s k -i + t, where t G (0, t k ). Then by ((U) we have that 

Rk(t) < e k max(t* k , t) 

< e k max(t fc _i, t) < e k T, 
Ra(T) = RA(sk-i) + Rk(T) 

< 2e fc _i + e k T < 5e k T. □ 



Lower bound: proof sketch. For the lower bound, we consider a metric space (X, d) with an infinitely 
many disjoint balls B{xi,r Jf ) for some r* > 0. For each ball i we define the wedge function supported on 
this ball: 

J minjr* — d(x, Xj), r* — r} if x £ B(xi, r*) 
(ju r)\ x ) = \ 

10 otherwise. 

The balls are partitioned into two infinite sets: the ordinary and special balls. The random payoff function 
is then defined by taking a constant function, adding the wedge function on each special ball, and randomly 
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adding or subtracting the wedge function on each ordinary ball. Thus, the expected payoff is constant 
throughout the metric space except that it assumes higher values on the special balls. However, the algorithm 
has no chance of ever finding these balls, because at time t they are statistically indistinguishable from the 
2~* fraction of ordinary balls that randomly happen to never subtract their wedge function during the first t 
steps of play. 

Lower bound: full proof. Suppose (X, d) is not compact. Fix r > such that X cannot be covered by a 
finite number of balls of radius r. There exists a countably infinite subset S C X such that the balls B(x, r), 
x G S are mutually disjoint. (Such subset can be constructed inductively.) Number the elements of S as 
s\,S2, ■ ■ ■ , and denote the ball B(si,r) by B(i). 

Suppose there exists a Lipschitz experts algorithm A that is g (t) -tractable for some g G o(t). Pick an 
increasing sequence ti,t2,--- G N such that tk+i > 2tk > 10 and g(tk) < r^t^jk for each k, where 

= r/2 k+1 . Let mo = and rrik = Yli=i ^ f° r k > 0, and let Ik = {rrik + 1, . . . , mk+i}- The intervals 
Ik form a partition of N into sets of sizes 4' 1 , 4' 2 , . . . . For every i £ N, let k be the unique value such that 
i G Ik and define the following Lipschitz function supported in B(si,r): 



If J C N is any set of natural numbers, we can define a distribution Pj on payoff functions by sampling 
independent, uniformly-random signs Oi G {±1} for every i £ N and defining the payoff function to be 



Note that the distribution P j has expected payoff function /i = \ + J2ie.J Let us define a distribution V 
over problem instances Pj by letting J be a random subset of N obtained by sampling exactly one element 
jk of each set Ik uniformly at random, independently for each k. 

Intuitively, consider an algorithm that is trying to discover the value of jV Every time a payoff function 
ir t is revealed, we get to see a random {±1} sample at every element of Ik and we can eliminate the 
possibility that jk is one of the elements that sampled —1. This filters out about half the elements of Ik in 
every time step, but \Ik\ = 4 tfe so on average it takes 2tk steps before we can discover the identity of jk- 
Until that time, whenever we play a strategy in Ui e j k B(i), there is a constant probability that our regret is at 
least rfc. Thus our regret is bounded below by > kg{tk). This rules out the possibility of a g (t) -tractable 
algorithm. The following lemma makes this argument precise. 

Lemma 7.2. Pr PeP [R {AtF) (t) = 0^(g(t))} = 0. 

Proof. Let ji,j2, ... be the elements of the random set J, numbered so that jk G Ik for all k. For any 
i,t £ i, let a(i, t) denote the value of <7j sampled at time t when sampling the sequence of i.i.d. payoff 
functions 7r t from distribution Pj. We know that a(jk,t) = 1 for all t. In fact if S(k,t) denotes the set 
of all i G Ik such that a(i, 1) = a(i,2) = ■ ■ ■ = a(i,t) = 1 then conditional on the value of the set 
S(k, t), the value of jk is distributed uniformly at random in S(k, t). As long as this set S(k, t) has at least 
n elements, the probability that the algorithm picks a strategy x t belonging to B(jk) at time t is bounded 
above by ^, even if we condition on the event that xt G Ui<=i k B(i). For any given i £ 4 \ {jk}, we have 
Fj(i G S(k,t)) = 2~ l and these events are independent for different values of i. Setting n = 2 tk , so that 




7T = t. 



2 + Sig J Gi + J2i£.J 
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\Ik\ = n2 > we have 



Fj[\S(k,t)\ <n]< J2nci k , \R\=n Vj[S(k,t) C R] 

n n 2 ) (i - 2-r 2 ~ n < (n 2 • (i - 2-r 1 

< exp (n(2 ln(n) - (n - 1) /2*)) . (5) 

As long as t < tk-i, the relation t k > 2t implies (n — l)/2* > y^n so the expression ([5]> is bounded 
above by exp (— n^/n + 2nln(n)), which equals exp (— 8* fc + 21n(4)ifc4 tfc ) and is in turn bounded above 
by exp (-8* fe /2) . 

Let B(j >k ) denote the union B(j k+ i) U B{j k+ 2) U . . . , and let N(t, k) denote the random variable 
that counts the number of times A selects a strategy in B(j >k ) during rounds 1, . . . , t. We have already 
demonstrated that for all t<t k , 

p Pr (x t e B(j >k )) < 2~ tfc +! + £ exp (-8*72) < 2 1 - i <=+ 1 , (6) 

where the term 2~ <fe + 1 accounts for the event that S(£, t) has at least 2 tfc + 1 elements, where £ in the index 
of the set Ig containing the number i such that x t € B(i), if such an i exists. Equation © implies the 
bound Ep j6 p[JV(ifc,/e)] < 4 • 2 1-tfc + 1 . By Markov's inequality, the probability that N(t k ,k) > f fc /2 is 
less than 2 2 ~* fc + 1 . By Borel-Cantelli, almost surely the number of A; such that N(t k , k) < t k /2 is finite. 
The algorithm's expected regret at time t is bounded below by r k {t k — N(t k , k)), so with probability 1, 
for all but finitely many k we have R(a pa (£&) > r k t k /2 > (k/2)g(t k ). This establishes that A is not 
g (i) -tractable. □ 



8 Lipschitz experts in a (very) high dimension 

In this section we discuss the full-feedback Lipschitz experts problem in (very) high dimensional metric 
spaces. We posit a new notion of dimensionality which is well-suited to describe regret in such problems, 
provide examples of metric spaces for which this notion is relevant (Section 18. lb - and analyze the perfor- 
mance of a simple algorithm in terms of this notion (Section HT2l ). Moreover, we consider the same algorithm 
under a somewhat restricted version of the problem, and obtain much better regret guarantees via a more 
involved analysis (Section [83T ). 

Fix a metric space (X, d). For a subset Y C X and S > 0, a 5-covering of Y is a collection of sets of 
diameter at most 8 whose union contains Y. A subset S C X is a 5-hitting set for Y if Y C U xg s B(x, 8). 
(So if S is a hitting set for some 5-covering of Y then it is a 5-hitting set for Y .) 

Let Ng(Y) be the minimal size (cardinality) of a J-covering of Y, i.e. the smallest number of sets of 
diameter at most 8 sufficient to cover Y. The standard definition of the covering dimension is 

Cov(y)=lim S up 1 ° g ^. ( ^ ) . (7) 
5>o log(l/5) 

Covering dimension and its refinements have been essential in the study of the Lipschitz MAB problem ll30l . 
However, for the full-feedback Lipschitz experts problem the metrics with bounded covering dimension are 
too "easy". We need to consider a much broader class of metrics that satisfy a non-trivial bound on what we 
call the log-covering-dimension: 

LCD(y) = lim SU p^Mn. ,8) 
«5>o log(l/<5) 
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8.1 Log-covering dimension: some examples 

To give an example of a metric space with a non-trivial log-covering dimension, let us consider a uniform 
tree - a rooted tree in which all nodes at the same level have the same number of children. An e-uniform tree 
metric is a metric on the leaves of an infinitely deep uniform tree, in which the distance between two leaves 
is e~\ where i is the level of their least common ancestor. It is easy to see that an e-uniform tree metric such 
that the branching factor at each level i is exp(e _ * b (2 b — 1)) has log-covering dimension b. 

For another example, fix a metric space (X, d) of finite diameter, and let Vx denote the set of all 
probability measures over X. Consider Vx as a metric space (Vx, W%) under the Wasserstein W\ metric, 
a.k.a. the Earthmover distance^] We claim that the log-covering dimension of this metric space is equal to 
the covering dimension of (X, d). 

Theorem 8.1. Let (X, d) be a metric space of finite diameter whose covering dimension is k < oo. Let 
(Vx j Wi ) be the space of all probability measures over (X, d) under the Wasserstein W\ metric. Then 
hCD(Vx,Wx) = k. 

In the remainder of this subsection we prove Theorem 18. II 

Proof (Theorem I&71 upper bound). Let us cover (Vx, W\) with balls of radius § for some k G N. Let S 
be a i-net in (X, d); note that \S\ = 0(k K ) for a sufficiently large k. Let P be the set of all probability 
distributions p on (X, d) such that support(p) C S and for every point x G S, p(x) is a rational number 
with denominator k d+1 . The cardinality of P is bounded above by (k K+1 ) k ^ . It remains to show that balls 
of radius | centered at the points of P cover the entire space (Vx, W\). This is true because: 

• every distribution q is \ -close to a distribution p with support contained in S (let p be the distribution 
defined by randomly sampling a point of (X, d) from q and then outputting the closest point of S); 

• every distribution with support contained in S is | -close to a distribution in P (round all probabilities 
down to the nearest multiple of /c _ ( K+1 ); this requires moving only i units of stuff). □ 

To prove the lower bound, we make a connection to the Hamming metric. 

Lemma 8.2. Let (X, d) be any metric space, and let H denote the Hamming metric on the Boolean cube 
{0, 1} U . If S C X is a subset of even cardinality 2n, and e is a lower bound on the distance between any 
two points of S, then there is a mapping f : {0, 1}" — > Vx such that for all a,b £ {0, l} n , 

W 1 (f(a)J(b))>^H(a,b). (9) 

Proof. Group the points of S arbitrarily into pairs Si = {xi, m}, where i = 1, . . . , n. For a £ {0, l} n and 
1 < i < n, define ti(a) = Xi if aj = 0, and U(a) = yi otherwise. Let f(a) be the uniform distribution 
on the set {ii(a), . . . , t n (a)}. To prove ©, note that if i is any index such that ai ^ b{ then f(a) assigns 
probability - to t{(a) while f(b) assigns zero probability to the entire ball of radius e centered at U(a). 
Consequently, the - units of probability at U(a) have to move a distance of at least e when shifting from 
distribution f(a) to f(b). Summing over all indices i such that / 6j, we obtain (©. □ 

The following lemma, asserting the existence of asymptotically good binary error-correcting codes, is 
well known, e.g. see |[T8ll36ll . 

8 For a metric space (X, d) of finite diameter, and two probability measures jj,, v on X the Wasserstein Wi distance, a.k.a. the 
Earthmover distance, is defined as Wi(fi, v) = inf E[|X — Y |], where the infimum is taken over all simultaneous distributions of 
the random variables X and Y with marginals /j, and v respectively. The Wasserstein distance defines a metric space on (Vx, Wi). 
It is one of the standard ways to define a distance on probability measures. In particular, it is widely used in Computer Science 
literature to compare discrete distributions, e.g. in the context of image retrieval 1391 . 
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Lemma 8.3. Suppose 5, p are constants satisfying < 5 < ^andO < p < 1+5 log 2 (<$) + (l— 5) log 2 (l— 5). 
For every sufficiently large n, the Hamming cube {0, l} n contains more than 2 pn points, no two of which 
are nearer than distance 5n in the Hamming metric. 

Combining these two lemmas, we obtain an easy proof for the lower bound in Theorem [8TTJ 

Proof (Theorem l&7r lower bound). Consider any 7 < k. The hypothesis on the covering dimension of 
(X, d) implies that for all sufficiently small e, there exists a set S of cardinality 2n — for some n > e~ 7 
— such that the minimum distance between two points of S is at least 5e. Now let C be a subset of {0, l} n 
having at least 2 n / 5 elements, such that the Hamming distance between any two points of C is at least n/5. 
Lemma [831 implies that such a set C exists, and we can then apply Lemma [831 to embed C in Vx, obtaining 
a subset of Vx whose cardinality is at least 2 e 7 /5 , with distance at least e between every pair of points in 
the set. Thus, any e-covering of Vx must contain at least 2 e 7//5 sets, implying that LCD(Vx, W\) > 7. As 
7 was an arbitrary number less than k, the proposition is proved. □ 

8.2 Using the new definition 

To see how the log-covering dimension is relevant to the Lipschitz experts problem, consider the following 
simple algorithm, called NAlVEExPERTS(b)H The algorithm is parameterized by b > 0. It runs in phases. 
Each phase i lasts for T = 2 l rounds, and outputs its best guess x* G X, which is played throughout phase 
i + 1. During phase i, the algorithm picks a J-hitting set for X of size at most N$(X), for 5 = T _1// ( 6+2 \ 
By the end of the phase, x* as defined as the point in S with the highest sample average (breaking ties 
arbitrarily). This completes the description of the algorithm. 

It is easy to see that the regret of NaiveExperts is naturally described in terms of the log-covering 
dimension. The proof is based the argument from [27 ]. We restate it here for the sake of completeness, and 
to explain how the new dimensionality notion is used. 

Theorem 8.4. Consider the full-feedback Lipschitz experts problem on a metric space (X, d). For each 
b > LCD(X), algorithm NAIVEEXPERTS(6) achieves regret R(t) = O^ 1-1 /^ 4 ^). 

Proof. Let N$ = N$(X), and let p be the expected payoff function. Consider a given phase i of the 
algorithm. Let T = 2* be the phase duration. Let 5 = T~ l K h+2 \ and let S C X the 5-hitting set chosen in 
this phase. Note that for any sufficiently large T it is the case that Ng < 2 s . For each x G S, let Pt{%) be 
the sample average of the feedback from x by the end of the phase. Then by Chernoff bounds, 

Pr[|^ T (x) - p(x)\ < r T ] > 1 - (TN s y 3 , where r T = ^8 \og(T N s ) ]f < 25. (10) 

Note that 5 is chosen specifically to ensure that rx < 0(5). 

We can neglect the regret incurred when the event in (flOl) does not hold for some x G S. From now 
on, let us assume that the event in (flOl) holds for all x G 5. Let x* be an optimal strategy, and x* = 
argmax x6S pr{x) be the "best guess". Let x G S be a point that covers x*. Then 

p(x*i) > Mt«) -25 > p T (x) - 25 > p(x) - 45 > p(x*) - 55. 

Thus the total regret -R«+i accumulated in phase i + 1 is 

Ri+i < 2 i+1 (p(x*) - p(x*)) < 0{5T) = 0(T 1-1 /( 2+6 )). 

Thus the total regret summed over phases is as claimed. □ 

'This algorithm is a version of the "naive algorithm" [27 30] for the Lipschitz MAB problem. A similar algorithm has been 
used by (23l to obtain regret R(T) — O(VT) for metric spaces of finite covering dimension. 
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8.3 The uniformly Lipschitz experts problem 



We now turn our attention to the uniformly Lipschitz, experts problem, a restricted version of the full-feedback 
Lipschitz experts problem in which a problem instance (X, d, P) satisfies a further property that each func- 
tion / G support (P) is itself a Lipschitz function on (X, d). We show that for this version, NaiveExperts 
obtains a significantly better regret guarantee, via a more involved analysis. As we will see in the next sec- 
tion, for a wide class of metric spaces including e-uniform tree metrics there is a matching upper bound. 

Theorem 8.5. Consider the uniformly Lipschitz experts problem with full feedback. Fix a metric space 
(X, d). For each b > LCD(X) such that b>2, NaiveExperts (b - 2) achieves regret R(t) = 0(t 1-1 / 6 ). 



Proof. The preliminaries are similar to those in the proof of Theorem 18.41 For simplicity, assume b > 2. 
Let Ns = N$(X), and let \i be the expected payoff function. Consider a given phase i of the algorithm. 
Let T = T be the phase duration. Let 5 = T~ x l l ', and let S be the 5-hitting set chosen in this phase. (The 
specific choice of 5 is the only difference between the algorithm here and the algorithm in Theorem 18.41 ) 
Note that \S\ < Ns, and for any sufficiently large T it is the case that Ng < 2 s . 

The rest of the analysis holds for any set S such that |5| < Ng. (That is, it is not essential that S is a 
(5-hitting set for X.) For each x G S, let v(x) be the sample average of the feedback from x by the end of the 
phase. Let y* = argmax(^, S) be the optimal strategy in the chosen sample, and let x* = argmax(^, S) be 
the algorithm's "best guess". The crux is to show that 

Pr[ M (y*)- M «)<0(51ogT)] > 1 - T'\ (11) 



Once (TTTT> is established, the remaining steps is exactly as the proof of Theorem [ 

Proving (TTTT t requires a new technique. The obvious approach - to use Chernoff Bounds for each x G S 
separately and then take a Union Bound - does not work, essentially because one needs to take the Union 
Bound over too many points. Instead, we will use a more efficient version tail bound: for each x, y G X, 
we will use Chernoff Bounds applied to the random variable f(x) — f(y), where / ~ P and (X, d, P) is the 
problem instance. For a more convenient notation, we define 

A(x,y) = iK x ) - Kv)} + l u (y) - u ( x )], 

Then for any N G N we have 
Pr 



A(x, y)\ < d(x, y) ^8 log(TiV)/rj > 1 - (TN)-*. (12) 

The point is that the "slack" in the Chernoff Bound is scaled by the factor of d(x, y). This is because each 
/ G support(P) is a Lipschitz function on (X, d), 

In order to take advantage of (fT2l . let us define the following structure that we call the covering tree of 
the metric space (X,d). This structure consists of a rooted tree T and non-empty subsets X(u) C X for 
each internal node u. Let Vr be the set of all internal nodes. Let Tj be the set of all level-j internal nodes 
(so that % is a singleton set containing the root). For each u G Vr, let C{u) be the set of all children of u. 
For each node u G Tj the structure satisfies the following two properties: (i) set X(u) has diameter at most 
2~ J , (ii) the sets X(v), v G C(u) form a partition of X(u). This completes the definition. 

By definition of the covering number Ng(-) there exist a covering tree T in which each node u G Tj has 
fan-out N 2 -j (X(u)). Fix one such covering tree. For each node u G Vr, define 

o(u) = argmax(/i, X(u) n S) (13) 
p(u) = argmax(^, X(u) n S), 

where the tie-breaking rule is the same as in the algorithm. 

Let n = [log |] . Let us say that phase % is clean if the following two properties hold: 
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(i) for each node u G Vr any two children v, w G C(u) we have | A(a(v), cr(w)) \ < 45. 

(ii) for any x,y G S such that d(x, y) < 5 we have | A(x, y)\ < 45. 

Claim 8.6. For any sufficiently large i, phase i is clean with probability at least 1 — T~ 2 . 

Proof. To prove (i), let j be such that u G Tj. We consider each j separately. Note that (i) is trivial for 
j > n. Now fix j < n and apply the Chernoff-style bound (fT2l with N = \Tj\ and (x, y) = cr(u'))- 
Since \T t \ < 2 2 "' for each sufficiently large /, it follows that log \Tj\ <C + J2i=i 2' 6 < C + § 2^ b , 

where C is a constant that depends only on the metric space and b. It is easy to check that for any sufficiently 
large phase i (which, in turn, determines T, 5 and n), the "slack" in (fT2l is at most 45: 

d(x, y) V8 \og(TN)/T < 3 d{x, y) y/\og(N)/T < 4 ^2 b i/2 hn = 45 2- { - n ~^ b ^l 2 < 45. 

Interestingly, the right-most inequality above is the only place in the proof where it is essential that b > 2. 
To prove (ii), apply (fT2l) with ./V = |£| similarly. Claim proved. □ 

From now on we will consider clean phase. (We can ignore regret incurred in the event that the phase 
is not clean.) We focus on the quantity A* (it) = A(a(u), p{u)). Note that by definition A*(n) > 0. The 
central argument of this proof is the following upper bound on A*(u). 

Claim 8.7. In a clean phase, A* (it) < 0(5)(n — j)for each j < n and each u G Tj. 

Proof. Use induction on j. The base case j = n follows by part (ii) of the definition of the clean phase, 
since for u G T n both a(u) and p(u) lie in X{u), the set of diameter at most 5. For the induction step, 
assume the claim holds for each v G Tj+i, and let us prove it for some fixed u G Tj. 

Pick children u, v G C(u) such that a(u) G X(v) and p(u) G X(w). Since the tie-breaking rules in (fT3T ) 
is fixed for all nodes in the covering tree, it follows that a(u) = a(v) and p(u) = p{w). Then 

A*(w) + A(a(v), a{w)) = A(cx(w), p{u)) + A(a(u), a(w)) 

= K<?(w)) - fi(fi(u)) + v{p{u)) - p{o-(w)) + 

p(cr(u)) - fi(a(w)) + v{a(w)) - v(a{u)) 
= A*(u). 

Claim follows since A*(w) < 0(5)(n — j — I) by induction, and A(a(v), cr(w)) < 45 by part (i) in the 
definition of the clean phase. □ 

To complete the proof of (fTTT) . let uq be the root of the covering tree. Then y* = u{uq) and x* = p{uq). 
Therefore by Claim [8771 (applied for % = {no}) we have 

0(5n) > A> ) = A*(y*, x*) > p(y*) - p(x*). □ 

9 Lipschitz experts in a (very) high dimension: regret characterization 

As it turns out, the log-covering dimension ([8]> is not the right notion to characterize regret for arbitrary 
metric spaces. We need a more refined version, similar to the max-min-covering dimension from [30]: 

MaxMinLCD(X) = sup ycX inf{ LCD(Z) : open non-empty Z C Y}. (14) 

Note that in general MaxMinLCD < LCD(X). The equality holds for "homogenous" metric spaces such as 
e-uniform tree metrics. We prove that Theorems 1 1 . 6 1 holds with b = MaxMinLCD(X). 
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Theorem 9.1. Fix a metric space (X, d) and let b = MaxMinLCD(X). The full -feedback Lipschitz experts 
problem on (X, d) is (i 7 ) -tractable for any 7 > and not (i 7 ) -tractable for any 7 < 

For the lower bound, we use a suitably "thick" version of the ball-tree from Section [3] in conjunction 
with the (e, 5, k) -ensemble idea from Section [3 see Section |9~T1 For the algorithmic result, we combine 
the "naive" experts algorithm (NaiveExperts) with (an extension of) the transfinite fat decomposition 
technique from ll30l . see Section [972] 

The lower bound in Theorem 19. 1 1 holds for the uniformly Lipschitz experts problem. It follows that the 
upper bound in Theorem 18-51 is optimal for metric spaces such that MaxMinLCD(X) = LCD(X), e.g. for e- 
uniform tree metrics. In fact, we can plug the improved analysis of NaiveExperts from Theorem 18 .5 1 into 
the algorithmic technique from Theorem l9.1l and obtain a matching upper bound in terms of the MaxMinLCD. 
Thus (in conjunction with Theorem 1 1.4b we have a complete characterization for regret: 

Theorem 9.2. Consider the uniformly Lipschitz experts problem with full feedback. Fix a metric space 
(X, d) with uncountably many points, and let b = MaxMinLCD(X). The problem on (X, d) is (t' y )-tractable 
for any 7 > max(^y-, |)> cindnot (t' y )-tractable for any 7 < max(^i, ^). 

The proof of the upper bound in Theorem 19.21 proceeds exactly that in Theorem 19. II except that we use 
a more efficient analysis of NaiveExperts. 

9.1 The MaxMinLCD lower bound: proof for Theorem [9^2] 

If MaxMinLCD(X) = d, and 7 < let us first choose b < c < d such that 7 < Let Y C X be 
a subspace such that c < inf{LCD(Z) : open, nonempty Z C Y}. We will repeatedly use the following 
packing lemma that relies on the fact that b < LCD(f7) for all nonempty U C Y. 

Lemma 9.3. For any nonempty open U C Y and any ro > there exists r G (0, ro) such that U contains 
more than 2 r disjoint balls of radius r. 

Proof. Let r < ro be a positive number such that every covering of U requires more than 2 r b balls of radius 
2r. Such an r exists, because LCD({7) > b. Now let V = {Bi, B2, ■ ■ ■ , Bm} be any maximal collection of 
disjoint r-balls. For every y G Y there must exist some ball B^ (1 < i < M) whose center is within distance 
2r of y, as otherwise B(y, r) would be disjoint from every element of V contradicting the maximality of 
that collection. If we enlarge each ball Bi to a ball Bf of radius 2r, then every y G Y is contained in one of 
the balls {Bf \ 1 < i < M}, i.e. they form a covering of Y. Hence M > 2 r b as desired. □ 

Using the packing lemma we recursively construct an infinite sequence of sets Bq, B\, . . . each consisting 
of finitely many disjoint open balls of equal radius r« in Y. Let Bo = {Y} and let ro = 1/4. If i > 0, let 
ri < rj_i/4 be a positive number small enough that for every ball B = B(x,ri-\) £ Bi-i* the sub-ball 
B(x,ri-i/2) contains rii = \2 r i ] disjoint balls of radius n. Let Bi(B) denote this collection of disjoint 
balls and let S, = UseB _ Bi(B). For each ball B = B(x, r) G Bi, define a "bump function" supported in 
U by 

[ min{r — d(x, sA, r/2} if x G B 
10 otherwise. 

Let Boo = Ugl £>j. Note that Boo is analogous to the ball-tree defined in Section [3] 
For a mapping a : Boo — ► define a function ir : X — ► [0, 1] by 

<x) = \ + T. B eB^( B )GB{x). (15) 
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The sum converges absolutely because every x £ X belongs to at most ball in Bi for each i, so the absolute 
value of the infinite sum in (fT3T > is bounded above by Yli^o r * < V^- Moreover, one can verify that our 
construction ensures that ir(x) is a Lipschitz function of x with Lipschitz constant 1. 

If Q is any subset of B^, one can define a problem instance of the full-feedback Lipschitz experts 
problem: a distribution Pg on payoff functions tt by sampling a(B) £ {±1} uniformly at random for 
B G" Q and performing biased sampling of cr(B) G {±1} with E[a(B)] = 1/3 when B G Q, and then 
defining 7r using (fT3T ). Note that the distribution Pg has expected payoff function /i. = | + X^seQ Gb fa- 
in proving the lower bound, we will consider the distribution V on Lipschitz experts problem instances 
Pg where Q is a random subset of Boo obtained by sampling one ball Bq G Bo uniformly at random, and 
also sampling one element Q(B) G Bi(B) uniformly at random and independently for each B G B%-\. 
By analogy with the notion of complete lineage defined in Section [3l we will refer to any such set Q as a 
complete lineage in B^. Given a complete lineage Q, we can define an infinite nested sequence of balls 
Bq D B\ D • • • by specifying that Bi + \ = Q{Bi) for each i. 

If /i is the expectation of a random payoff function ir sampled from Pg, then p, achieves its maximum 
value \ + \ Y^hLo r * at tne uruc l ue point x* G n^ i?i. At any point x G" Bj, we have 

fi(x*) - fi(x) > E.= 3 r i)-(i Er=j+i r ») = \ r i- 

We now finish the lower bound proof as in the proof of Lemma 13.41 For each complete lineage Q and 
ball B G Bi-\, let B l ,B % , B n * be the elements of Bi(B). Consider the sets Q = Q\Q(B) and 

Qj = Q U {B^} for j = 1,2, ...,rii. The distributions (p Qo , P Ql , . . . , F Qn ^ ^ constitute an (e, 5, k)- 

ensemble for e = ^r^, 5 = ^, and k = rii. Consequently, for tj = r^ b , the inequality t{ < ln(17fc)/2(5 2 
holds, and we obtain a lower bound of 

R {A V Qj ){ti) > eU/2 = Sl(r]- b ) = Q(tf- 1)/b ) 

for at least half of the distributions Pg^ in the ensemble. Recalling that 7 < 2^1 > we see that the problem is 
not t 7 -tractable. 

9.2 The MaxMinLCD upper bound: proofs for Theorem 19.11 and Theorem 19.21 

First, let us incorporate the analysis from Section [8] via the following lemma. 

Lemma 9.4. Consider an instance (X, d, P) of the full-feedback Lipschitz, experts problem, and let x* G X 
be an optimal point. Fix subset U C X which contains x* , and let b > LCD([7). Then for any sufficiently 
large T and 5 = T -1 /( &+2 ) the following holds: 

(a) Let S be a b-hitting set for U of cardinality \S\ < Ns(U). Consider the feedback of all points in S 
over T rounds; let x be the point in S with the largest sample average (break ties arbitrarily). Then 

Pr[ / u(x*) - fi(x) < 0(5logT)} > 1 - T~ 2 . 

(b) For a uniformly Lipschitz experts problem and b > 2, property (a) holds for 5 = T^ 1 ^. 

Transfinite LCD decomposition. We redefine the transfinite fat decomposition from ll30l with respect to 
the log-covering dimension rather than the covering dimension. 

Definition 9.5. Fix a metric space (X, d). Let (5 denote an arbitrary ordinal. A transfinite LCD decomposi- 
tion of depth j3 and dimension b is a transfinite sequence {Sa}o<a</3 °f c l° se d subsets of X such that: 
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(a) So = X, Sp = 0, and S u ~D S\ whenever v < A. 

(b) if V C X is closed, then the set {ordinals u < j3; V intersects S u } has a maximum element. 

(c) for any ordinal A < j3 and any open set U C X containing Sa+i we have LCD(Sa \ U) < b. 

The existence of suitable decompositions and the connection to MaxMinLCD is derived exactly as in 
Proposition 3.15 in |f30ll . 

Lemma 9.6. For every compact metric space (X, d), MaxMinLCD(X) is equal to the infimum of all b such 
that X has a transfinite LCD decomposition of dimension b. 

In what follows, let us fix metric space (X, d) and b > MaxMinLCD(X), and let {Sa}o<a</3 be a trans- 
finite LCD decomposition of depth (3 and dimension b. For each x G X, let the depth of x be the maximal 
ordinal A such that x G S\. (Such an ordinal exists by Definition 19 .5 l b).) 

Access to the metric space. The algorithm requires two oracles: the depth oracle Depth(-) and the cov- 
ering oracle Cover(-). Both oracles input a finite collection T of open balls Bq, B\, . . . , B n , given via the 
centers and the radii, and return a point in X. Let B be the union of these balls, and let B be the closure of 
B. A call to oracle Depth(.F) returns an arbitrary point x G B n 5a, where A is the maximum ordinal such 
that 5a intersects B. (Such an ordinal exists by Definition |9.5f b).) Given a point y* G X of depth A, a call 
to oracle Cover (y*, J 7 ) either reports that B covers 5a, or it returns an arbitrary point x £ S\ \ B. A call to 
Cover (0, J 7 ) is equivalent to the call Cover(y*, J 7 ) for some y* G So- 

The covering oracle will be used to construct 5-nets as follows. First, using successive calls to Cover(0, J 7 ) 
one can construct a <5-net for X. Second, given a point y* G X of depth A and a collection of open balls 
whose union is B, using successive calls to Cover(y*, •) one can construct a 5-net for 5a \ B. The second 
usage is geared towards the scenario when 5a+i C B and for some optimal strategy x* we have x* G S\\B. 
Then by Definition |9.5l c) we have LCD(5a \B) < b, and one can apply Lemma [9741 

The algorithm. Our algorithm proceeds in phases i = 1, 2, 3, . . . of 2 l rounds each. Each phase i outputs 
two strategies: x*,y* G X that we call the best guess and the depth estimate. Throughout phase i, the 
algorithm plays the best guess x*_ 1 from the previous phase. The depth estimate is used "as if its 
depth is equal to the depth of some optimal strategy. (We show that for a large enough i this is indeed the 
case with a very high probability.) 

In the end of the phase, an algorithm selects a finite set Ai C X of active points, as described below. 
Once this set is chosen, x* is defined simply as a point in A, with the largest sample average of the feedback 
(breaking ties arbitrarily). It remains to define y* and A4 itself. 

Let T = 2 l be the phase duration. Using the covering oracle, the algorithm constructs (roughly) the finest 
r-net containing at most points. Specifically, the algorithm constructs 2~- ? -nets A/}, for j = 0, 1, 2, . . ., 
until it finds the largest j such that Mj contains at most 2^ points. Let r = 2 -jf and J\f = J\fj. 

For each let ht(x) be the sample average of the feedback during this phase. Let 

Ar(x) = fj,j, — Ht(x), where p,^ = max(//T, AT) 
Define the depth estimate y* to be the output of the oracle call Depth(J r ), where 

T = {B(x, r) : x G N and A T (x) < r}. 
Finally, let us specify A4. Let B be the union of balls 

{B(x, r) : x G M and A T (x) > 2(r T + r) }, (16) 
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where rx = y / 81og(T |jV|)/T is chosen so that by Chernoff Bounds we have 

Pr[\n T (x) -n(x)\ < r T ] > 1 - (T|AA|)~ 3 for each x £ N. (17) 

Let S = T~ l / b for the uniformly Lipschitz experts problem, and 5 = T~ l K h+2 } otherwise. Let Qt = 2 s b 
be the quota on the number of active points. Given a point y*_ 1 whose depth is (say) A, algorithm uses the 
covering oracle to construct a <5-net J\f' for S\ \ B. Define Aj as M' or an arbitrary Qy-point subset thereof, 
whichever is smaller^ 

(Very high-level) sketch of the analysis. The proof roughly follows that of Theorem 3.16 in f30l . Call a 
phase clean if the event in (fTTT ) holds for all x E Mi and the appropriate version of this event holds for all 
x G Aj. (The regret from phases which are not clean is negligible). On a very high level, the proof consists 
of two steps. First we show that for a sufficiently large i, if phase i is clean then the depth estimate y* is 
correct, in the sense that it is indeed equal to the depth of some optimal strategy. The argument is similar to 
the one in Lemma l4~6l Second, we show that for a sufficiently large i, if the depth estimate y*_ 1 is "correct" 
(i.e. its depth is equal to that of some optimal strategy), and phase i is clean, then the "best guess" x* is 
good, namely fi(x*) is within 0{5logT) of the optimum. The reason is that, letting A be the depth of y*_ v 
one can show that for a sufficiently large T the set B (defined in ( fT6l )) contains S^+i an d does not contain 
some optimal strategy. By definition of the transfinite LCD decomposition we have LCD(5a \ U) < b, so in 
our construction the quota Qt on the number of active points permits Aj to be a J-cover of 5a \ U. Now 
we can use Lemma [94l to guarantee the "quality" of x*. The final regret computation is similar to the one 
in the proof of Theorem l8.4l 
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Appendix A: Reduction to compact metric spaces 

In this section we reduce the Lipschitz MAB problem to that on complete metric spaces. 

Lemma A.l. The Lipschitz. MAB problem on a metric space (X, d) is f {t)-tractable if and only if it is /(£)- 
tractable on the completion of(X, d). Likewise for the Lipschitz experts problem with double feedback. 

Proof. Let (X, d) be a metric space with completion (Y, d). Since Y contain an isometric copy of X, we 
will abuse notation and consider X as a subset of Y. We will present the proof the Lipschitz MAB problem; 
for the experts problem with double feedback, the proof is similar. 

Given an algorithm Ax which is /(t) -tractable for (X, d), we may use it as a Lipschitz MAB algorithm 
for (Y, d) as well. (The algorithm has the property that it never selects a point of Y \ X, but this doesn't 
prevent us from using it when the metric space is (Y, d).) The fact that X is dense in Y implies that for 
every Lipschitz payoff function p defined on Y, we have sup(/x, X) = supQu, Y). From this, it follows 
immediately that the regret of Ax, when considered a Lipschitz MAB algorithm for (X, d), is the same as 
its regret when considered as a Lipschitz MAB algorithm for (Y, d). 

Conversely, given an algorithm Ay which is /(i)-tractable for (Y, d), we may design a Lipschitz MAB 
algorithm Ax for (X, d) by running Ay and perturbing its output slightly. Specifically, for each point 
y G Y and each t G N we fix x = x(y, t) G X such that d(x, y) < 2 - '. If Ay recommends playing strategy 
yt £ Y at time t, algorithm Ax instead plays x = x(y, t). Let tt be the observed payoff. Algorithm Ax 
draws an independent 0-1 random sample with expectation tt, and reports this sample to Ay. This completes 
the description of the modified algorithm Ax- 

Suppose Ax is not /(t)-tractable. Then for some problem instance X on (Y, d), letting Rx{t) be the 
expected regret of Ax on this instance, we have that sup f gN Rx (t)/ f(t) = oo. Let p be the expected payoff 
function in X. Consider the following two problem instances of a MAB problem on Y, called X\ and X2, in 
which if point y G Y is played at time t, the payoff is an independent 0- 1 random sample with expectation 
p(y) and p{x{y,t)), respectively. Note that algorithm Ay is /(t)-tractable on X\, and its behavior on X2 
is identical to that of Ax on the original problem instance X. It follows that by observing the payoffs of 
Ay one can tell apart X\ and X2 with high probability. Specifically, there is a "classifier" C which queries 
one point in each round, such that for infinitely many times t it tell apart X\ and X2 with success probability 
p(t) — > 1. Now, the latter is information-theoretically impossible. 

To see this, let H t be the t-round history of the algorithm (the sequence of points queried, and outputs 
received), and consider the distribution of H t under problem instances I m and T g (call these distributions 
qi and (72). Let us consider and look at their KL-divergence. By the chain rule (See Lemma uT2l) . we can 
show that KL(qx, 52) < \- (We omit the details.) It follows that letting St be the event that C classifies the 
instance as X\ after round t, we have PgJSt] — IP^I'St] ^ KL(q±, 52) < \- For any large enough time t, 
¥ qi [St] < t, in which case C makes a mistake (on X2) with constant probability. □ 
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Lemma A.2. Consider The experts MAB problem with full feedback. If it is f {t)-tractable on a metric space 
(X, d) then it is f '(t) -tractable on the completion of (X, d). 



Proof. Identical to the easy ("only if) direction of Lemma lATTl 



□ 



Remark. Lower bounds only require Lemma [AT2l or the easy ("only if) direction of Lemma lATTl For the 
upper bounds (algorithmic results), we can either quote the "if direction of Lemma lATTl or prove the desired 
property directly for the specific type algorithms that we use (which is much easier but less elegant). 

Appendix B: KL-divergence techniques 

Our proof will use the notion of Kullback-Leibler divergence (or KL-divergence), defined for two prob- 
ability measures as follows. 

Definition B.l. Let Q, be a finite set with two probability measures p, q. Their Kullback-Leibler divergence, 
or KL-divergence, is the sum 



with the convention that p{x) \a.{p{x) / q{x)) is interpreted to be when p(x) = and +00 when p{x) > 
and q(x) =0. If Y is a random variable defined on Q and taking values in some set T, the conditional 
Kullback-Leibler divergence of p and q given Y is the sum 



where terms containing log(0) or log(oo) are handled according to the same convention as above. 

The definition can be applied to an infinite sample space f2 provided that q is absolutely continuous with 
respect to p. For details, see 11281 , Chapter 2.7. The following lemma summarizes some standard facts about 
KL-divergence; for proofs, see Ifl4ll28l . 

Lemma B.2. Let p,q be two probability measures on a measure space (fi, J-) and let Y be a random 
variable defined on f2 and taking values in some finite set T. Define a pair of probability measures py, qy 
on T by specifying that py [if) = p(Y = y), qy(y) = q(Y = y)for each y 6 T. Then 



and KL(p; q \ Y) is non-negative. 

An easy corollary is the following lemma which expresses the KL-divergence of two distributions on 
sequences as a sum of conditional KL-divergences. 

Lemma B.3. Let £1 be a sample space, and suppose p, q are two probability measures on Q n , the set of 
n-tuples of elements of£l. For a sample point uj £ Q n , let uj 1 denote its first i components. Ifp 1 , q % denote 
the probability measures induced on by p (resp. q) then 



KL(p;q) = J27=i KL{p i ;q i \ a/" 1 ). 
Proof. For m = 1, 2, . . . , n, the formula KL(p m ; q m ) = Y^ILi KL(p l ; q % \ uj 1 ^ 1 ) follows by induction on 





KL(p; q) = KL(p; q\Y) + KL(p Y ; q Y ) 



m, using Lemma uT2l 



□ 
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The following three lemmas will also be useful in our lower bound argument. Here and henceforth 
we will use the following notational convention: for real numbers a, b G [0, 1], KL(a; b) denotes the KL- 
divergence KL(p; q) where p, q are probability measures on {0, 1} such that p({l}) = a, g({l}) = b. In 
other words, 

KL(a; b) = am (f) + (1 - a) In (±Ef ) . 

Lemma B.4. For any < e < y < 1, KL(y — e; y) < e 2 /y(l — y). 

Proof. A calculation using the inequality ln(l + x) < x (valid for x > 0) yields 

KL(y - e ;y) = (y- e) In (^) + (1 - y + e) In (^) 

<(*-*) + (^-l) 

_ -e(y-e) _|_ e(l-j/+e) _ e 2 p, 
V 1-2/ 2/(1-2/)' 

Lemma B.5. Le? £1 be a sample space with two probability measures p, q whose KL-divergence is k. For 
any event £, the probabilities p(£), q{£) satisfy 

g (£)>p(£)exp(-^). 

A consequence of the lemma, stated in less quantitative terms, is the following: if k = KL(p; q) is 
bounded above and p(£) is bounded away from zero then q{£) is bounded away from zero. 

Proof. Leta = p(£), b = q(£),c= (1 — a) I (1 — b). Applying Lemma lBT2l with Y as the indicator random 
variable of £ we obtain 

k = KL(p- q) > KL( PY ; q Y ) = am (f ) + (1 - a) In (^f ) = aln (f ) + (1 - b) cln(c). 
Now using the inequality cln(c) > — 1/e, (valid for all c > 0) we obtain 

k > aln(a/b) - (1 - 6)/e > aln(a/6) - 1/e. 
The lemma follows by rearranging terms. □ 
Lemma B.6. Let p, q be two probability measures, and suppose that for some 5 E (0, ^] they satisfy 

V events £, l-S<^B<l + d 

Then KL(p;q) < 5 2 . 

Proof. We will prove the lemma assuming the sample space is finite. The result for general measure spaces 
follows by taking a supremum. 

For every x in the sample space f2, let r(x) = |M — 1 and note that \r(x)\ < 5 for all x. Now we make 
use of the inequality ln(l + x) < x — x 2 , valid for x > — |. 

KL(p; q) = E, p(x) In = £ x p(x) In ( TT ^ J ) 

= — J2 X p( x ) m (i + r ( x )) — — J2x p( x ) i r ( x ) ~ ( r ( x )) 2 } 
<-(E x p( x M x )) + s 2 (J2 x p( x )) 

= -(E x q( x )-p( x )) + s 2 = 6 2 . □ 
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Proof of Theorem I3.6t Let Q = [0, l] x . Using Property [T]of an (e, 5, fc)-ensemble combined with Lemma lB.6l 
we find that KL(P i; P ) < 5 2 . 

Let A be an experts algorithm whose random bits are drawn from a sample space F with probability 
measure v. For any positive integer s < ln(17fc) /25 2 , let pf denote the measure v x (Pj) s on the probability 
space T x Q s . By the chain rule for KL-divergence (Lemma [B.3l ). KL(p s i ;pf ) ) < s5 2 < ln(17fc)/2. Now let 
£f denote the event that A selects a point x G Si at time s. If pf (£/) > \ then Lemma 1531 implies 

mt) > Pi(S!) exp (- Ml7 ^ 1/e ) > \ exp (- ln(fc) + ln(17) - f ) > 1. 

The events {£? | 1 < i < k} are mutually exclusive, so fewer than k/4 of them can satisfy Pq(££) > f • 
Consequently, fewer than k/4 of them can satisfy pf(£f) > \, a property we denote in this proof by saying 
that s is satisfactory for i. Now assume i < ln(17/c)/25 2 . For a uniformly random i G {1, . . . , fc}, the 
expected number of satisfactory s G {1, . . . , t} is less than t/4, so by Markov's inequality, for at least half 
of the i G {1, . . . , k}, the number of satisfactory s G {1, ...,£} is less than t/2. Property |2]of an (e, 5, k)- 
ensemble guarantees that every unsatisfactory s contributes at least e to the regret of A when the problem 
instance is Pj. Therefore, at least half of the measures Pj have the property that R^f>.^(t) > et/2. □ 

B.l Proof of ClaimO 

Recall that in Section [5] we defined a pair of payoff functions /io,Mi and a ball Bi of radius ri such that 
fio = ^ on X \ Bi, while for x G Bi we have 

| < IMiix) < m{x) < Ho(x) + | < |. 

Thus, by Lemma lB~4l KL(/j,o(x); m{x)) < rf/3 for all x G X, and KL(no(x); ^i{x)) = for x Bi. 

Represent the algorithm's choice and the payoff observed at any given time t by a pair (x t ,yt)- Let 
Q, = X X [0, 1] denote the set of all such pairs. When a given algorithm A plays against payoff functions 
Ho, Hi, this defines two different probability measures p\,p\ on the set 0* of possible i-step histories. Let u/ 
denote a sample point in 0*. The bounds derived in the previous paragraph imply that for any non-negative 
integer s, 

KL{p s + 1 ;p^ 1 1 oj s ) < ±rf P (x s+ i G fl f ). (18) 
Summing equation ([TBI for s = 0, 1, . . . , t — 1 and applying Lemma 1531 we obtain 

KLipl-Ji) < \r\ £' s= iPoOr s G B t ) = ^(A^t)), (19) 

where the last equation follows from the definition of iVj (t) as the number of times algorithm A selects a 
strategy in B; L during the first t rounds. 

The bound stated in Claim [5Tl now follows by applying Lemma 1531 with the event S playing the role of 
£, Pq playing the role of p, and Pj playing the role of q. 



Appendix C: Topological equivalences: proof of Lemma 11 .9 



Let us restate the lemma, for the sake of convenience. Recall that it includes an equivalence result for 
compact metric spaces, and two implications for arbitrary metric spaces: 

Lemma C.l. For any compact metric space (X, d), the following are equivalent: (i) X is a countable set, 
( ii) (X, d) is well -order able, ( Hi) no subspace of (X, d) is perfect. For an arbitrary metric space we have 
(ii) <^=^ (Hi) and (i)=5>(ii), but not (ii)^(i). 
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Proof: compact metric spaces. Let us prove the assertions in the circular order. 

(i) implies (iii). Let us prove the contrapositive: if (X, d) has a perfect subspace Y, then X is uncount- 
able. We have seen that if (X, d) has a perfect subspace Y then it has a ball-tree. Every leaf £ of the ball-tree 
(i.e. infinite path starting from the root) corresponds to a nested sequence of balls. The closures of these 
balls have the finite intersection property, hence their their intersection is non-empty. Pick an arbitrary point 
of the intersection and call if x(€). Distinct leaves £, £' correspond to distinct points x(£), x(£') because if 
(y,r y ), (z,r z ) are siblings in the ball-tree which are ancestors of £ and £', respectively, then the closures 
of B(y, r y ) and B(z, r z ) are disjoint and they contain x{£), x{£') respectively. Thus we have constructed a 
set of distinct points of X, one for each leaf of the ball-tree. There are uncountably many leaves, so X is 
uncountable. 

(iii) implies (ii). Let j3 be some ordinal of strictly larger cardinality than X. Let us define a transfinite 
sequence {xa}a</3 of points in X using transfinite recursio by specifying that xq is any isolated point 
of X, and that for any ordinal A > 0, x\ is any isolated point of the subspace (Y\, d), where Y\ = X \ 
\x v : v < A}, as long as Y\ is nonempty. (Such isolated point exists since by our assumption subspace 
(Y\, d) is not perfect.) If Y\ is empty define e.g. x\ = xq. Now, Y\ is empty for some ordinal A because 
otherwise we obtain a mapping from X onto an ordinal j3 whose cardinality exceeds the cardinality of X. 
Let Po = min{A : Y\ = 0}. Then every point in X has been indexed by an ordinal number A < (3q, and so 
we obtain a well-ordering of X. By construction, for every x = x\ we can define a radius r(x) > such 
that B(x, r(x)) is disjoint from the set of points \x v : v > A}. Any initial segment S of the well-ordering 
is equal to the union of the balls {B(x, r{x)) : x E S}, hence is an open set in the metric topology. Thus 
we have constructed a topological well-ordering of X. 

(ii) implies (i). Suppose we have a binary relation -< which is a topological well-ordering of (X, d). Let 
S(n) denote the set of all x G X such that B(x, -) is contained in the set P{x) = {y : y H x}. By the 
definition of a topological well-ordering we know that for every x, P(x) is an open set, hence x G S(n) for 
sufficiently large n. Therefore X = U n& fqS(n). Now, the definition of S(n) implies that every two points 
of S(n) are separated by a distance of at least 1/n. (If x and z are distinct points of S(n) and x -< z, then 
B(x, -) is contained in the set P(x) which does not contain z, hence d(x, z) > ^.) Thus by compactness 
of (X, d) set S(n) is finite. □ 

Proof: arbitrary metric spaces. For implications ( i)=>( ii) and ( iii)^( ii), the proof above does not in fact use 
compactness. An example of an uncountable but well-orderable metric space is (R, d), where d is a uniform 
metric. It remains to prove that (ii)^(iii). 

Suppose there exists a topological well-ordering -<. For each subset Y C X and an element A G Y let 
Y^ (A) = {y G Y : y H A} be the corresponding initial segment. 

We claim that -< induces a topological well-ordering on any subset Y C X. We need to show that for 
any A G Y the initial segment Y^ (A) is open in the metric topology of (Y, d). Indeed, fix y E Y^ (A). The 
initial segment X->(A) is open by the topological well-ordering property of X, so Bx(y, e) C X^(\) for 
some e > 0. Since Y^(A) = X^(X) n Y and B Y (y, e) = B x {y, e) n Y, it follows that By(y, e) C K-(A). 
Claim proved. 

Suppose the metric space (X, d) has a perfect subspace Y C X. Let A be the -< -minimum element 
of Y. Then Y^(X) = {A}. However, by the previous claim -< is a topological well-ordering of (Y,d), so 
the initial segment Y^(X) is open in the metric topology of (Y, d). Since (Y, d) is perfect, Y^(A) must be 
infinite, contradiction. This completes the (ii)^(iii) direction. □ 



""Transfinite recursion" is a theorem in set theory which asserts that in order to define a function F on ordinals, it suffices to 
specify, for each ordinal A, how to determine F(X) from F(u), v < A. 
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